Lab 3

Part A

Setting Up

#Loading libraries
library(spatstat)
library(sp)
library(sf)
library(tmap)
library(car)
library(elevatr)
#reading in data
data(longleaf)
data(bei)
summary(longleaf)
## Marked planar point pattern:  584 points
## Average intensity 0.0146 points per square metre
## 
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 metres
## 
## marks are numeric, of type 'double'
## Summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    9.10   26.15   26.84   42.12   75.90 
## 
## Window: rectangle = [0, 200] x [0, 200] metres
## Window area = 40000 square metres
## Unit of length: 1 metre
head(longleaf)
## Marked planar point pattern: 6 points
## marks are numeric, of storage type  'double'
## window: rectangle = [0, 200] x [0, 200] metres
plot(longleaf)

The spatstat package for R is a package designed for spatial statistics with a heavy emphasis on point pattern analysis

?bei

Using the ?bei command in R we can find more information on the contents of the bei dataset. The bei dataset contains the positions of 3605 trees of the Beilschmiedia pendula species on Barro Colorado Island. This data set is a small portion of a much larger data set containing the locations of far more trees of multiple species.

plot(bei)

There is positive autocorrelation as there is clear evidence of clustering as seen in the above plot of the bei dataset. The location fallacy may cause issues in this lab as there may be factors of the location sampled itself that is not recorded that influences the location of these trees.

#to help find errors in helpfile
summary(bei)
## Planar point pattern:  3604 points
## Average intensity 0.007208 points per square metre
## 
## Coordinates are given to 1 decimal place
## i.e. rounded to the nearest multiple of 0.1 metres
## 
## Window: rectangle = [0, 1000] x [0, 500] metres
## Window area = 5e+05 square metres
## Unit of length: 1 metre

One discrepancy between the helpfile and the bei dataset is that the helpfile claims there are 3605 points in the dataset. When running the summary command for bei, however, there are only 3604 points in the dataset.

Standard Ellipse

#finding spatial mean of bei
x.mean.bei <- mean(bei$x) 
y.mean.bei <- mean(bei$y)

print(x.mean.bei)
## [1] 433.7792
print(y.mean.bei)
## [1] 263.5103
dataEllipse(bei$x, bei$y, levels=0.95, 
            fill=TRUE, fill.alpha=0.1,
            col= c( rgb(red=0, green=0, blue=0, alpha = 0.5),
                    "black"),    
            pch=4, cex=.5,    
            xlab="x (m)",ylab="y (m)",
#increasing limits of graph to make it readable at 2 standard deviation
            xlim=c(min(bei$x)-200,max(bei$x)+200),
            ylim=c(min(bei$y)-200,max(bei$y)+200)) 

Quadrat analysis

The first spatial correlation is looking at relationships between direct neighbors while the second order spatial correlation looks for patterns beyond just immediate neighbors. Since the bei dataset records locations of trees on an island the number of distant neighbors is limited and the number of immediate neighbors is high. Therefore the first order of spatial correlation is more likely to influence the dataset than the second.

Q <- quadratcount(bei, nx = 4, ny = 4)
#table of counts
print(Q)
##            x
## y           [0,250) [250,500) [500,750) [750,1e+03]
##   [375,500]     368       506        64         287
##   [250,375)     298       171        66         194
##   [125,250)     324        27        54         178
##   [0,125)       220       138       589         120
#table of intensities
intensity(Q)
##            x
## y            [0,250) [250,500) [500,750) [750,1e+03]
##   [375,500] 0.011776  0.016192  0.002048    0.009184
##   [250,375) 0.009536  0.005472  0.002112    0.006208
##   [125,250) 0.010368  0.000864  0.001728    0.005696
##   [0,125)   0.007040  0.004416  0.018848    0.003840
#plot of counts
plot(bei,       
     use.marks = F,  
     cex = 0.5, pch = 19,    
     main="bei quadrat count") 

plot(Q, add = T, cex = 2)  

#plot of intensity
plot(intensity(Q, image=T), main="Point intensity", las=1)
plot(bei, pch=20, cex=0.6,  add=T)

#VMR
bei.var <- var(as.vector(Q)) 
bei.mean <- mean(as.vector(Q))

bei.VMR <- bei.var / bei.mean
print(bei.VMR)
## [1] 116.9755

This data is highly clustered as we can see visually from the two plots as well as by looking at the VMR which is significantly higher than one.

#quadrat hypothesis test
quadrat.test(bei)
## 
##  Chi-squared test of CSR using quadrat counts
## 
## data:  bei
## X2 = 2009.9, df = 24, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 
## Quadrats: 5 by 5 grid of tiles

The p-value calculated is extremely low meaning it is highly unlikely that this distribution is completely random and there is likely one or more factors playing a role in where these trees are located.

Q.4.Boxes  <- quadratcount(bei, nx = 2, ny = 2) 
Q.20.Boxes  <- quadratcount(bei, nx = 5, ny = 4) 
Q.100.Boxes <- quadratcount(bei, nx = 10, ny = 10)      

#changed this to put them in a column instead of row
par(mfrow=c(row=3,col=1),mar=c(1,1,1,3)) 

plot(intensity(Q.4.Boxes, image=T), main="", las=1)

plot(bei, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=T)    
plot(Q.4.Boxes, add = T, cex = 2,col="white",font=2)               

plot(intensity(Q.20.Boxes, image=T), main="", las=1)                
plot(bei, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=T)    
plot(Q.20.Boxes, add = T, cex = 1,col="white",font=2)               

plot(intensity(Q.100.Boxes, image=T), main="", las=1)                
plot(bei, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=T)    
plot(Q.100.Boxes, add = T, cex = .5,col="white",font=2) 

#creating each table separate to make them easier to read
plot(intensity(Q.4.Boxes, image=T), main="4 Boxes", las=1)                
plot(bei, pch=20, cex=0.6, col=rgb(0,0,0,.5), add=T)    
plot(Q.4.Boxes, add = T, cex = 2,col="white",font=2)               

#creating each table separate to make them easier to read
plot(intensity(Q.20.Boxes, image=T), main="20 Boxes", las=1)                
plot(bei, pch=20, cex=0.4, col=rgb(0,0,0,.5), add=T)    
plot(Q.20.Boxes, add = T, cex = 1,col="white",font=2) 

#creating each table separate to make them easier to read
plot(intensity(Q.100.Boxes, image=T), main="100 Boxes", las=1)                
plot(bei, pch=20, cex=0.3, col=rgb(0,0,0,.5), add=T)    
plot(Q.100.Boxes, add = T, cex = .75,col="white",font=2)

By altering the number of quadrats we can see that the location of clustering in this dataset changes in a clear example of the modifiable areal unit problem where a conclusion can be altered based on the size of zones used in the analysis.

Kernel Density Estimation (KDE)

bei.diggle <- bw.diggle(bei)

plot (density(bei,sigma = bei.diggle), 
      main=paste("Bandwidth Estimation = ",round(bei.diggle,3)))

plot(bei, add = T,use.marks = F, cex = 0.1,pch=16)

It is clear from looking at the above analysis that there is clustering of this tree and there is likely a factor influencing the trees location . We cannot however determine what is the root cause of the trees spatial distribution based on the data provided. Using quadrat analysis and creating a heat-map did an excellent job of visualizing this fact however simply calculating the p-value as well as the VMR was most useful in determining whether the tree’s spatial distribution was random. Using the standard ellipse, while easy to create, does not effectively assist in determining whether a dataset is distributed in a random manner or not.

Part B

Reading in Data

market<- read.csv("Farmers_Markets_NYShort.csv")

market.sf  <- st_as_sf(market,coords=c("Longitude","Latitude"),crs=4326)


market.utm <- st_transform(market.sf,crs=26918)

Map & Plot

#creating map
tmap_mode("view") 
qtm(market.utm)
#Creating plot
 market.ppp <- as.ppp(market.utm)
 plot(market.ppp,use.marks = F,cex = 1, pch = 18)

This new dataset we are now looking at records the location and qualities of farmer’s markets throughout Upstate New York. The dataset can be found at the following website: https://agriculture.ny.gov/farmersmarkets. This dataset records attributes such as operating season, contact information, licensing information and SNAP acceptance. By creating a quick plot and map we can clearly see that there is some clustering along the Hudson River Valley from NYC to Albany and then across the Erie Canal, specifically in the cities of Syracuse, Rochester, and Buffalo. This is to be expected as these areas are the most populated in Upstate New York.

Nearest Neighbor

 market.nnd <- nndist(market)
#changed bins from 50 to 10 as it is unreadable at 50
 hist(market.nnd,br=10)

 paste( c("The average nearest neighbour of the market data is:"),
        round(mean(market.nnd),3),"km away")
## [1] "The average nearest neighbour of the market data is: 1.343 km away"
#copypasted code from lab tutorial does not work have to use "market.ppp" to get it to run
 market.obs.RRatio <- clarkevans(market.ppp)
 market.obs.RRatio
##     naive  Donnelly       cdf 
## 0.4803501 0.4716527 0.4742574
#changed variable to "market.ppp"
 clarkevans.test(market.ppp)
## 
##  Clark-Evans test
##  No edge correction
##  Z-test
## 
## data:  market.ppp
## R = 0.48035, p-value < 2.2e-16
## alternative hypothesis: two-sided

The first two code chunks show us the average distance between any given farmer’s market and the closest farmer’s market to it and a histogram visualizing the nnd values of this dataset. Following this we calculate the Clark Evans Ratio which can tell us how concentrated or distributed something is within a geographic area. An R value above 1 would suggest a dispersed distribution while an R value below 1 would signify a very concentrated dispersion. The calculated R value from the Clark Evans test is 0.48035 which signifies that NY farmer’s markets are clustered as the R value is below 1.

#changed variable to "market.ppp"
 clarkevans.test(market.ppp,
                 alternative = "clustered")
## 
##  Clark-Evans test
##  No edge correction
##  Z-test
## 
## data:  market.ppp
## R = 0.48035, p-value < 2.2e-16
## alternative hypothesis: clustered (R < 1)

Edge Effects

Edge effects in point pattern data is a concept where the boundary of a dataset may misrepresent the qualities of data points on this boundary. In the context of this dataset where only data from Upstate New York is recorded the markets on the boundary between Upstate and Downstate seem very isolated and therefore have a high value for nearest neighbor distance. However it is highly likely that there are many markets in the NYC area and the inclusion of those markets would have a noticeable impact on the data points that at the moment are on the edge of the dataset.

Locational fallacy

The locational fallacy is a common issue when working with spatial data. This happens when the location assigned to a data point is not representative of its true location and nature. This is especially true when looking at something that is not pinned down in one place such as farmer’s markets. Farmers markets are fluid entities whose occupants and locations are always shifting from season to season and even week to week. Therefore it is highly likely that there are data points in this dataset that are either partially or wholly incorrect.

Modifiable Area Unit Problem (MAUP)

This concepts relates to the scale chosen for spatial analysis can effect its representation. Choosing an inappropriate scale has the potential to grossly misrepresent data and skew results. In the concept of this dataset it does not play a major role as we are only interested in farmer’s markets in Upstate New York and our scale is the entirety of Upstate New York.

Ripley’s L-Function

#changed variable to "market.ppp"
market.lest <- Lest(market.ppp,correction = "Ripley")

market.lest.IRPsimulation <- envelope(market.ppp, 
                              Lest, correction = "Ripley", 
                              verbose = F,
                              nsim=1000, nrank=1, 
                              savepatterns = F, savefuns = F)
L.EST.Max<-(max(market.lest$iso-market.lest$r))
L.EST.Min <- 0-L.EST.Max

plot(market.lest,. - r ~ r, ylim=c(L.EST.Min,L.EST.Max))


plot(market.lest.IRPsimulation, . - r ~ r,add=T)

Each simulation is represented by a gray line. As you run these simulations in this case 1000 of them the lines continuously overlap until the resemble a grey cloud in the middle of the graph. The more simulations you run the more lines are generated leading to what looks like a larger grey cloud. The output of this graph shows us that the data is unusually clustered as it above the gray cloud at all times. This conclusion is further corroborated by the earlier Clark Evans test. The correction option in the Kest command allows you to specify which particular forms of edge corrections you would like to apply to the data when running the command. You also have the option to apply all forms of correction at the same time if needed.